14 research outputs found

    T-Crowd: Effective Crowdsourcing for Tabular Data

    Full text link
    Crowdsourcing employs human workers to solve computer-hard problems, such as data cleaning, entity resolution, and sentiment analysis. When crowdsourcing tabular data, e.g., the attribute values of an entity set, a worker's answers on the different attributes (e.g., the nationality and age of a celebrity star) are often treated independently. This assumption is not always true and can lead to suboptimal crowdsourcing performance. In this paper, we present the T-Crowd system, which takes into consideration the intricate relationships among tasks, in order to converge faster to their true values. Particularly, T-Crowd integrates each worker's answers on different attributes to effectively learn his/her trustworthiness and the true data values. The attribute relationship information is also used to guide task allocation to workers. Finally, T-Crowd seamlessly supports categorical and continuous attributes, which are the two main datatypes found in typical databases. Our extensive experiments on real and synthetic datasets show that T-Crowd outperforms state-of-the-art methods in terms of truth inference and reducing the cost of crowdsourcing

    Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads

    Full text link
    LSM-trees are widely adopted as the storage backend of key-value stores. However, optimizing the system performance under dynamic workloads has not been sufficiently studied or evaluated in previous work. To fill the gap, we present RusKey, a key-value store with the following new features: (1) RusKey is a first attempt to orchestrate LSM-tree structures online to enable robust performance under the context of dynamic workloads; (2) RusKey is the first study to use Reinforcement Learning (RL) to guide LSM-tree transformations; (3) RusKey includes a new LSM-tree design, named FLSM-tree, for an efficient transition between different compaction policies -- the bottleneck of dynamic key-value stores. We justify the superiority of the new design with theoretical analysis; (4) RusKey requires no prior workload knowledge for system adjustment, in contrast to state-of-the-art techniques. Experiments show that RusKey exhibits strong performance robustness in diverse workloads, achieving up to 4x better end-to-end performance than the RocksDB system under various settings.Comment: 25 pages, 13 figure

    Biological Factor Regulatory Neural Network

    Full text link
    Genes are fundamental for analyzing biological systems and many recent works proposed to utilize gene expression for various biological tasks by deep learning models. Despite their promising performance, it is hard for deep neural networks to provide biological insights for humans due to their black-box nature. Recently, some works integrated biological knowledge with neural networks to improve the transparency and performance of their models. However, these methods can only incorporate partial biological knowledge, leading to suboptimal performance. In this paper, we propose the Biological Factor Regulatory Neural Network (BFReg-NN), a generic framework to model relations among biological factors in cell systems. BFReg-NN starts from gene expression data and is capable of merging most existing biological knowledge into the model, including the regulatory relations among genes or proteins (e.g., gene regulatory networks (GRN), protein-protein interaction networks (PPI)) and the hierarchical relations among genes, proteins and pathways (e.g., several genes/proteins are contained in a pathway). Moreover, BFReg-NN also has the ability to provide new biologically meaningful insights because of its white-box characteristics. Experimental results on different gene expression-based tasks verify the superiority of BFReg-NN compared with baselines. Our case studies also show that the key insights found by BFReg-NN are consistent with the biological literature

    Label Propagation for Graph Label Noise

    Full text link
    Label noise is a common challenge in large datasets, as it can significantly degrade the generalization ability of deep neural networks. Most existing studies focus on noisy labels in computer vision; however, graph models encompass both node features and graph topology as input, and become more susceptible to label noise through message-passing mechanisms. Recently, only a few works have been proposed to tackle the label noise on graphs. One major limitation is that they assume the graph is homophilous and the labels are smoothly distributed. Nevertheless, real-world graphs may contain varying degrees of heterophily or even be heterophily-dominated, leading to the inadequacy of current methods. In this paper, we study graph label noise in the context of arbitrary heterophily, with the aim of rectifying noisy labels and assigning labels to previously unlabeled nodes. We begin by conducting two empirical analyses to explore the impact of graph homophily on graph label noise. Following observations, we propose a simple yet efficient algorithm, denoted as LP4GLN. Specifically, LP4GLN is an iterative algorithm with three steps: (1) reconstruct the graph to recover the homophily property, (2) utilize label propagation to rectify the noisy labels, (3) select high-confidence labels to retain for the next iteration. By iterating these steps, we obtain a set of correct labels, ultimately achieving high accuracy in the node classification task. The theoretical analysis is also provided to demonstrate its remarkable denoising "effect". Finally, we conduct experiments on 10 benchmark datasets under varying graph heterophily levels and noise types, comparing the performance of LP4GLN with 7 typical baselines. Our results illustrate the superior performance of the proposed LP4GLN

    Truth Inference in Crowdsourcing: Is the Problem Solved?

    Get PDF
    Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions

    Learning Decomposed Spatial Relations for Multi-Variate Time-Series Modeling

    No full text
    Modeling multi-variate time-series (MVTS) data is a long-standing research subject and has found wide applications. Recently, there is a surge of interest in modeling spatial relations between variables as graphs, i.e., first learning one static graph for each dataset and then exploiting the graph structure via graph neural networks. However, as spatial relations may differ substantially across samples, building one static graph for all the samples inherently limits flexibility and severely degrades the performance in practice. To address this issue, we propose a framework for fine-grained modeling and utilization of spatial correlation between variables. By analyzing the statistical properties of real-world datasets, a universal decomposition of spatial correlation graphs is first identified. Specifically, the hidden spatial relations can be decomposed into a prior part, which applies across all the samples, and a dynamic part, which varies between samples, and building different graphs is necessary to model these relations. To better coordinate the learning of the two relational graphs, we propose a min-max learning paradigm that not only regulates the common part of different dynamic graphs but also guarantees spatial distinguishability among samples. The experimental results show that our proposed model outperforms the state-of-the-art baseline methods on both time-series forecasting and time-series point prediction tasks

    A predictive method to determine incomplete electronic medical records

    Full text link
    © 2018 Association for Computing Machinery. This paper is utilizing predictive models to determine missing electronic medical records (EMR) at general practice offices. Prior research has addressed the missing values problem in the EMRs used for secondary analysis. However, health care providers are overlooking the missing records problem that stores the patients’ medical visits information in EMRs. Our study provides a technique to predict the number of EMR entries for each practice based on their past data records. If the number of EMR entries is less than predicted, it warns the occurrence of missing records with the 95% confidence interval. The study uses seven years of EMRs from 14 general practice offices to train the predictive model. The model predicts EMR data entries and accordingly identified missing EMRs for the following year. We compared the actual visits illustrated by de-identified billing data to the predictive model. The study found auto-correlation method improves the performance of identifying missing records by detecting the period of prediction. In addition, artificial neural networks and support vector machines perform better than other predictive methods depending on whether the analysis aims at detecting missing EMRs or when identifying complete EMRs with no missing records. Results suggest that clinicians and medical professionals should be mindful of the potential missing records of EMRs prior any secondary analysis
    corecore